Search CORE

15 research outputs found

Dynamic Graphs on the GPU

Author: Ashkiani Saman
Awad Muhammad A.
Owens John D.
Porumbescu Serban D.
Publication venue: eScholarship, University of California
Publication date: 01/05/2020
Field of study

We present a fast dynamic graph data structure for the GPU. Our dynamic graph structure uses one hash table per vertex to store adjacency lists and achieves 3.4–14.8x faster insertion rates over the state of the art across a diverse set of large datasets, as well as deletion speedups up to 7.8x. The data structure supports queries and dynamic updates through both edge and vertex insertion and deletion. In addition, we define a comprehensive evaluation strategy based on operations, workloads, and applications that we believe better characterize and evaluate dynamic graph data structures

Crossref

eScholarship - University of California

GPU LSM: A Dynamic Dictionary Data Structure for the GPU

Author: Amenta Nina
Ashkiani Saman
Farach-Colton Martin
Li Shengren
Owens John D.
Publication venue
Publication date: 01/01/2018
Field of study

We develop a dynamic dictionary data structure for the GPU, supporting fast insertions and deletions, based on the Log Structured Merge tree (LSM). Our implementation on an NVIDIA K40c GPU has an average update (insertion or deletion) rate of 225 M elements/s, 13.5x faster than merging items into a sorted array. The GPU LSM supports the retrieval operations of lookup, count, and range query operations with an average rate of 75 M, 32 M and 23 M queries/s respectively. The trade-off for the dynamic updates is that the sorted array is almost twice as fast on retrievals. We believe that our GPU LSM is the first dynamic general-purpose dictionary data structure for the GPU.Comment: 11 pages, accepted to appear on the Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS'18

arXiv.org e-Print Archive

eScholarship - University of California

Parallel Algorithms and Dynamic Data Structures on the Graphics Processing Unit: a warp-centric approach

Author: Ashkiani Saman
Publication venue: eScholarship, University of California
Publication date: 01/12/2017
Field of study

Graphics Processing Units (GPUs) are massively parallel processors with thousands of active threads originally designed for throughput-oriented tasks.In order to get as much performance as possible given the hardware characteristics of GPUs, it is extremely important for programmers to not only design an efficient algorithm with good enough asymptotic complexities, but also to take into account the hardware limitations and preferences.In this work, we focus our design on two high level abstractions: work assignment and processing. The former denotes the assigned task by the programmer to each thread or group of threads. The latter encapsulates the actual execution of assigned tasks.Previous work conflates work assignment and processing into similar granularities. The most traditional way is to have per-thread work assignment followed by per-thread processing of that assigned work. Each thread sequentially processes a part of input and then the results are combined appropriately. In this work, we use this approach in implementing various algorithms for the string matching problem (finding all instances of a pattern within a larger text).Another effective but less popular idea is per-warp work assignment followed by per-warp processing of that work. It usually requires efficient intra-warp communication to be able to efficiently process input data which is now distributed among all threads within that warp. With the emergence of warp-wide voting and shuffle instructions, this approach has gained more potential in solving particular problems efficiently and with some benefits compared to the per-thread assignment and processing. In this work, we use this approach to implement a series of parallel algorithms: histogram, multisplit and radix sort.An advantage of using similar granularities for work assignment and processing is in problems with uniform per-thread or per-warp workloads, where it is quite easy to adapt warp-synchronous ideas and achieve high performance.However, with non-uniform irregular workloads, different threads might finish their processing in different times which can cause a sub-par performance. This is mainly because the whole warp continues to be resident in the device as long as all its threads are finished.With these irregular tasks in mind, we propose to use different granularities for our work assignment and processing.We use a per-thread work assignment followed by a per-warp processing; each thread is still responsible for an independent task, but now all threads within a warp cooperate with each other to perform all these tasks together, one at a time, until all are successfully processed.Using this strategy, we design a dynamic hash table for the GPU, the slab hash, which is a totally concurrent data structure supporting asynchronous updates and search queries: threads may have different operations to perform and each might require an unknown amount of time to be fulfilled. By following our warp-cooperative strategy, all threads help each other perform these operations together, causing a much higher warp efficiency compared to traditional conflated work assignment and processing schemes

Ezid

eScholarship - University of California

Engineering a High-Performance GPU B-Tree

Author: Ashkiani Saman
Ashkiani Saman
ElTantawy A.
Fix Jordan
Geil Afton
Huang Yulong
Kaczmarski Krzysztof
Zhan Yang
Publication venue: eScholarship, University of California
Publication date: 01/02/2019
Field of study

We engineer a GPU implementation of a B-Tree that supports concurrent queries (point, range, and successor) and updates (insertions and deletions). Our B-tree outperforms the state of the art, a GPU log-structured merge tree (LSM) and a GPU sorted array. In particular, point and range queries are significantly faster than in a GPU LSM (the GPU LSM does not implement successor queries). Furthermore, B-Tree insertions are also faster than LSM and sorted array insertions unless insertions come in batches of more than roughly 100k. Because we cache the upper levels of the tree, we achieve lookup throughput that exceeds the DRAM bandwidth of the GPU. We demonstrate that the key limiter of performance on a GPU is contention and describe the design choices that allow us to achieve this high performance

Crossref

eScholarship - University of California

Recommended from our members

Parallel Approaches to the String Matching Problem on the GPU

Author: Amenta Nina
Ashkiani Saman
Owens John D
Publication venue: eScholarship, University of California
Publication date: 19/05/2016
Field of study

We design a family of parallel algorithms and GPU implementations for the exact string matching problem, based on Rabin-Karp (RK) randomized string matching. We describe and analyze three primary parallel approaches to binary string matching: cooperative (CRK), divide-and-conquer (DRK), and a novel hybrid of both (HRK). The CRK is most effective for large patterns (>8K characters), while the DRK approach is superior for shorter patterns.We then generalize the DRK to support any alphabet size without loss of performance. Our DRK method achieves up to a 64 GB/s processing rate on 8-character patterns from an 8-bit alphabet on an NVIDIA Tesla K40c GPU.We next demonstrate a novel parallel two-stage matching method (DRK-2S), which first skims the text for a smaller subset of the pattern and then verifies all potential matches in parallel.Our DRK-2S method is superior for pattern sizes up to 64k compared to the fastest CPU-based string matching implementations. With an 8-bit alphabet and up to 1k-character patterns, we get a geometric mean speedup of 4.81x against the best CPU methods, and can achieve a processing rate of at least 53 GB/s

eScholarship - University of California

Error correction via smoothed L0-norm recovery

Author: Ashkiani Saman
Babaie-Zadeh Massoud
Jutten Christian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/06/2011
Field of study

International audienceChannel coding has been considered as a classical approach to overcome corruptions occurring in some elements of input signal which may lead to loss of some information. Proper redundancies are added to the input signal to improve the capability of detecting or even correcting the corrupted signal. A similar scenario may happen dealing with real-field numbers rather than finite-fields. This paper considers a way to reconstruct an exact version of a corrupted signal by using an encoded signal with proper number of redundancies. The proposed algorithm uses Graduated Non-Convexity method beside using a smoothed function instead of 0-norm to correct all the corrupted elements. Simulations show that our proposed algorithm substantially improves the probability of exact recovery in comparison to previous algorithms

Hal - Université Grenoble Alpes

A Dynamic Hash Table for the GPU

Author: Ashkiani Saman
Farach-Colton Martin
Owens John D
Publication venue: eScholarship, University of California
Publication date: 01/01/2018
Field of study

We design and implement a fully concurrent dynamic hash table for GPUs with comparable performance to the state of the art static hash tables. We propose a warp-cooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the traditional way of per-thread (or per-warp) work assignment and processing. By using this strategy, we build a dynamic non-blocking concurrent linked list, the slab list, that supports asynchronous, concurrent updates (insertions and deletions) as well as search queries. We use the slab list to implement a dynamic hash table with chaining (the slab hash). On an NVIDIA Tesla K40c GPU, the slab hash performs updates with up to 512 M updates/s and processes search queries with up to 937 M queries/s. We also design a warp-synchronous dynamic memory allocator, SlabAlloc, that suits the high performance needs of the slab hash. SlabAlloc dynamically allocates memory at a rate of 600 M allocations/s, which is up to 37x faster than alternative methods in similar scenarios.Comment: 11 pages, accepted to appear on the Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Recommended from our members

GPU Multisplit

Author: Ashkiani Saman
Davidson Andrew
Meyer Ulrich
Owens John D
Publication venue: eScholarship, University of California
Publication date: 19/05/2016
Field of study

Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer.Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. However, sort does more work than necessary to implement multisplit, and is thus inefficient.In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small number of buckets.In our implementations, we exploit the computational hierarchy of the GPU to perform most of the work locally, with minimal usage of global operations.We also use warp-synchronous programming models to avoid branch divergence and reduce memory usage, as well as hierarchical reordering of input elements to achieve better coalescing of global memory accesses.On an NVIDIA K40c GPU, for key-only (key-value) multisplit, we demonstrate a 3.0--6.7x (4.4--8.0x) speedup over radix sort, and achieve a peak throughput of 10.0 G keys/s

eScholarship - University of California